library(tidyverse)
library(knitr)

Body temperature data

A paper, Decreasing human body temperature in the United States since the Industrial Revolution, presented evidence that human body temperatures in the United States have been decreasing over the past one hundred or so years. (Many scientists dispute the conclusions of the paper.) One of the data sets in the paper is taken from the NHANES (National Health and Nutrition Examination Study), and is available in the file NHANES_processed.csv. There are many variables in the data, but our focus will be on the variable temp that provides resting oral body temperatures.

  1. Draw a histogram of the body temperature variable.
nhanes <- read.csv("NHANES_processed.csv",header = TRUE)
names <- read.csv("CensusNames.csv")
nhanes
Temperture <- nhanes$temp
hist(Temperture)

The empirical cumulative distribution function (ecdf) gives the proportion of data values at or below a particular value. In the case of the body temperature data, \(\text{ecdf}(98.1)\), for example, gives the proportion of temperatures in the data set below \(98.1\) degrees. A few minutes of thought reveals that

  • \(\text{ecdf}(x) = 0\) if \(x\) is less than the minimum temperature in the data set;
  • \(\text{ecdf}(x) = 1\) if \(x\) is greater than or equal to the maximum temperature in the data set;
  • \(\text{ecdf}(x)\) is a non-decreasing function of \(x\).
  1. Use the stat_ecdf function in ggplot2 to draw an “empirical cumulative distribution function” for the body temperature variable.
ggplot(nhanes,aes(Temperture))+
  stat_ecdf(geom = "step")

Look at the plot of the ecdf to gain more insight into the behavior of the ecdf.

  1. Plot vertical lines in the ecdf plot for the estimate of the 25th, 50th, and 75th percentiles of the temperature data and estimate the values. Hint: use the geom_vline() function with an xintercept argument.
ggplot(nhanes,aes(Temperture)) +
  stat_ecdf(geom = "step") +
  geom_vline(xintercept=98) + geom_vline(xintercept=98.4) + geom_vline(xintercept=98.6)

How do these compare with the percentiles computed via the summary() function?

summary(nhanes)
       X               V1        study_ID         sample_weights  
 Min.   :    1   Min.   :Inf   Length:15301       Min.   :     0  
 1st Qu.: 6467   1st Qu.:Inf   Class :character   1st Qu.:  1288  
 Median :12884   Median :Inf   Mode  :character   Median :  4752  
 Mean   :12641   Mean   :Inf                      Mean   :  7536  
 3rd Qu.:19117   3rd Qu.:Inf                      3rd Qu.: 10381  
 Max.   :23808   Max.   :Inf                      Max.   :100990  
                                                                  
      temp           time_HR          race               sex           
 Min.   : 95.20   Min.   : 0.00   Length:15301       Length:15301      
 1st Qu.: 98.00   1st Qu.:10.93   Class :character   Class :character  
 Median : 98.40   Median :14.67   Mode  :character   Mode  :character  
 Mean   : 98.31   Mean   :14.42                                        
 3rd Qu.: 98.60   3rd Qu.:16.57                                        
 Max.   :101.80   Max.   :23.97                                        
                  NA's   :8769                                         
      age        year_of_birth   exam_date           exam_year   
 Min.   :21.00   Min.   :1896   Length:15301       Min.   :1971  
 1st Qu.:31.00   1st Qu.:1908   Class :character   1st Qu.:1972  
 Median :45.00   Median :1928   Mode  :character   Median :1973  
 Mean   :46.55   Mean   :1926                      Mean   :1973  
 3rd Qu.:64.00   3rd Qu.:1941                      3rd Qu.:1974  
 Max.   :75.00   Max.   :1953                      Max.   :1975  
                                                                 
   exammonth      exam_findings      exam_ICD      exam_ICD2   
 Min.   : 1.000   Min.   :0.000   Min.   :   9   Min.   :  11  
 1st Qu.: 4.000   1st Qu.:2.000   1st Qu.: 528   1st Qu.:9999  
 Median : 7.000   Median :2.000   Median :9999   Median :9999  
 Mean   : 6.506   Mean   :1.626   Mean   :6148   Mean   :8273  
 3rd Qu.: 9.000   3rd Qu.:2.000   3rd Qu.:9999   3rd Qu.:9999  
 Max.   :12.000   Max.   :8.000   Max.   :9999   Max.   :9999  
                                                               
    region          birth_cohort       head_eyes_ears_nose_findings
 Length:15301       Length:15301       Mode :logical               
 Class :character   Class :character   FALSE:4456                  
 Mode  :character   Mode  :character   TRUE :10845                 
                                                                   
                                                                   
                                                                   
                                                                   
 thyroid_findings chest_findings  cardiovascular_findings abdominal_findings
 Mode :logical    Mode :logical   Mode :logical           Mode :logical     
 FALSE:340        FALSE:3868      FALSE:2374              FALSE:6255        
 TRUE :14961      TRUE :11433     TRUE :12927             TRUE :9046        
                                                                            
                                                                            
                                                                            
                                                                            
 musculosceletal_findings neurological_findings skin_findings  
 Mode :logical            Mode :logical         Mode :logical  
 FALSE:2209               FALSE:1770            FALSE:2143     
 TRUE :13092              TRUE :13531           TRUE :13158    
                                                               
                                                               
                                                               
                                                               
 general_findings no_findings        thyroid        weight_KG     
 Mode :logical    Mode :logical   Min.   :0.000   Min.   : 32.32  
 FALSE:3046       FALSE:12136     1st Qu.:1.000   1st Qu.: 58.97  
 TRUE :12255      TRUE :3165      Median :1.000   Median : 68.49  
                                  Mean   :1.078   Mean   : 70.44  
                                  3rd Qu.:1.000   3rd Qu.: 79.72  
                                  Max.   :8.000   Max.   :181.44  
                                                                  
   height_CM          BMI            temp_C         bmi_adj        
 Min.   :132.8   Min.   :12.41   Min.   :35.11   Min.   :-12.8240  
 1st Qu.:159.6   1st Qu.:21.84   1st Qu.:36.67   1st Qu.: -2.2594  
 Median :165.6   Median :24.68   Median :36.89   Median :  0.5705  
 Mean   :166.3   Mean   :25.42   Mean   :36.84   Mean   :  1.3745  
 3rd Qu.:172.6   3rd Qu.:28.00   3rd Qu.:37.00   3rd Qu.:  3.9716  
 Max.   :205.1   Max.   :72.26   Max.   :38.78   Max.   : 48.9415  
                                                                   
  height_norm       weight_norm       
 Min.   :-22.252   Min.   :-26.59393  
 1st Qu.:  4.458   1st Qu.:  0.05607  
 Median : 10.458   Median :  9.57607  
 Mean   : 11.204   Mean   : 11.52316  
 3rd Qu.: 17.458   3rd Qu.: 20.80607  
 Max.   : 49.978   Max.   :122.52607  
                                      

The 20th,50th,75th line up pretty good compared to the plots vs summary

  1. How does the mean temperature compare to the median temperature? What is the comparison telling you about the distribution of temperature?

They are both very close 98.40 and 98.31 close, tells us that the distribution of data is very close in range

Name data

The file CensusNames.csv contains data on surnames in the United States, including the name, the rank of the name among all names, and the number of people in the United States with that name. (Some uncommon names are not included in the data.)

  1. Is your surname in the list? If so, what is its rank, and how many people in the United States have that name?
name <- names %>%
  filter(name == "DOVER") %>%
  group_by(count)
  1. Draw a histogram and an ecdf plot of the variable which counts the number of people with a name. What do you notice from these plots?
ggplot(name, aes(x=count)) +
  geom_histogram()

ggplot(name,aes(count))+
  stat_ecdf(geom = "step")

They dint really give much information to draw a conclusion about the information

  1. Draw a histogram and an ecdf plot of the base 10 logarithm of the counts. The R function log10() computes base 10 logarithms.
ggplot(name, aes(x=log10(count))) +
  geom_histogram()

ggplot(name,aes(log10(count)))+
  stat_ecdf(geom = "step")

  1. From the ecdf plot of the logarithm of the counts, estimate the 25th, 50th, and 75th percentiles of the counts. Compare these to the computed percentiles from the summary() function. Are they similar?
summary(name$count)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9742    9742    9742    9742    9742    9742 

Maybe my plots are wrong but they dont really comapre

  1. How does the mean count compare to the median count? How does the mean count compare to the 75th percentile of the counts?

The numbers are all the same.

US cities data

This section is similar to the previous one.

The file UnitedStatesCities.csv contains data on cities in the United States, including the name of the city, the population rank of the city among all cities, and the number of people living in the city in 2010.

  1. What are the population and rank of East Lansing?
city <- read.csv("UnitedStatesCities.csv")
city
city %>%
  filter(City == "Houston city, Texas") %>%
  select(Rank, Population2010)
  1. Draw a histogram and an ecdf plot of the populations. Do these data seem more like the data on body temperatures or the data on the count of people with a particular name?
ggplot(city, aes(x = Population2010)) +
  geom_histogram(
    binwidth = 10000,
    fill     = "steelblue",
    color    = "black"
  ) +
  labs(
    title = "Histogram of U.S. City Populations",
    x     = "Population",
    y     = "Count"
  ) +
  theme_minimal()

ggplot(city, aes(x = Population2010)) +
  stat_ecdf(geom = "step") +
  labs(
    title = "ECDF of U.S. City Populations",
    x     = "Population",
    y     = "ECDF"
  ) +
  theme_minimal()

  1. Draw a histogram and an ecdf plot of the base-10 logarithm of the populations.
city <- city %>% 
  mutate(logpop = log10(Population2010))

# Histogram of log10(population)
ggplot(city, aes(x = logpop)) +
  geom_histogram(binwidth = 0.1, fill = "darkcyan", color = "black") +
  labs(title = "Histogram of log10(Population)",
       x     = "log10(Population)",
       y     = "Count") +
  theme_minimal()

# ECDF of log10(population)
ggplot(city, aes(x = logpop)) +
  stat_ecdf(geom = "step") +
  labs(title = "ECDF of log10(Population)",
       x     = "log10(Population)",
       y     = "ECDF") +
  theme_minimal()

  1. From the ecdf plot of the logarithm of the populations, estimate the 25th, 50th, and 75th percentiles of the populations. Compare these to the computed percentiles from the summary() function. Are they similar?
summary(city$logpop)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   2.567   3.059   3.138   3.660   6.912 
# Extract exactly the 1st, 2nd, and 3rd quartiles via quantile()
quantile(city$logpop, probs = c(0.25, 0.5, 0.75))
     25%      50%      75% 
2.567026 3.059185 3.660439